INN Hotels Project¶

Context¶

A significant number of hotel bookings are called-off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

  • Loss of resources (revenue) when the hotel cannot resell the room.
  • Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
  • Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
  • Human resources to make arrangements for the guests.

Objective¶

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description¶

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

  • Booking_ID: unique identifier of each booking
  • no_of_adults: Number of adults
  • no_of_children: Number of Children
  • no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
  • no_of_week_nights: Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel
  • type_of_meal_plan: Type of meal plan booked by the customer:
    • Not Selected – No meal plan selected
    • Meal Plan 1 – Breakfast
    • Meal Plan 2 – Half board (breakfast and one other meal)
    • Meal Plan 3 – Full board (breakfast, lunch, and dinner)
  • required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
  • room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels.
  • lead_time: Number of days between the date of booking and the arrival date
  • arrival_year: Year of arrival date
  • arrival_month: Month of arrival date
  • arrival_date: Date of the month
  • market_segment_type: Market segment designation.
  • repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
  • no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
  • no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
  • avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
  • no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
  • booking_status: Flag indicating if the booking was canceled or not.

Importing necessary libraries and data¶

In [63]:
# Installing the libraries with the specified version.
!pip install pandas==1.5.3 numpy==1.25.2 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 statsmodels==0.14.1 -q --user

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

In [64]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# setting the precision of floating numbers to 5 decimal points
pd.set_option("display.float_format", lambda x: "%.5f" % x)

# Library to split data
from sklearn.model_selection import train_test_split

# To build model for prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models
from sklearn.model_selection import GridSearchCV


# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    precision_recall_curve,
    roc_curve,
    make_scorer,
)

import warnings
warnings.filterwarnings("ignore")

from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter("ignore", ConvergenceWarning)
In [66]:
# Upload the dataset from my computer
from google.colab import files
hotel = files.upload()
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving INNHotelsGroup.csv to INNHotelsGroup (1).csv
In [67]:
# copying data to another variable to avoid any changes to original data
data = hotel.copy()
In [68]:
# Read the CSV file into a pandas DataFrame
df = pd.read_csv('INNHotelsGroup.csv')

Data Overview¶

  • Observations
  • Sanity checks
In [69]:
# Display the first 5 rows
print("First 5 rows of the DataFrame:")
display(df.head())

# Print the shape of the DataFrame
print("\nShape of the DataFrame:")
print(df.shape)

# Display data types of each column
print("\nData types of each column:")
display(df.info())

# Generate descriptive statistics for numerical columns
print("\nDescriptive statistics for numerical columns:")
display(df.describe())

# Generate descriptive statistics for non-numerical columns
print("\nDescriptive statistics for non-numerical columns:")
display(df.describe(include='object'))

# Check for missing values
print("Missing values per column:")
display(df.isnull().sum())
First 5 rows of the DataFrame:
Booking_ID no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status
0 INN00001 2 0 1 2 Meal Plan 1 0 Room_Type 1 224 2017 10 2 Offline 0 0 0 65.00000 0 Not_Canceled
1 INN00002 2 0 2 3 Not Selected 0 Room_Type 1 5 2018 11 6 Online 0 0 0 106.68000 1 Not_Canceled
2 INN00003 1 0 2 1 Meal Plan 1 0 Room_Type 1 1 2018 2 28 Online 0 0 0 60.00000 0 Canceled
3 INN00004 2 0 0 2 Meal Plan 1 0 Room_Type 1 211 2018 5 20 Online 0 0 0 100.00000 0 Canceled
4 INN00005 2 0 1 1 Not Selected 0 Room_Type 1 48 2018 4 11 Online 0 0 0 94.50000 0 Canceled
Shape of the DataFrame:
(36275, 19)

Data types of each column:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   Booking_ID                            36275 non-null  object 
 1   no_of_adults                          36275 non-null  int64  
 2   no_of_children                        36275 non-null  int64  
 3   no_of_weekend_nights                  36275 non-null  int64  
 4   no_of_week_nights                     36275 non-null  int64  
 5   type_of_meal_plan                     36275 non-null  object 
 6   required_car_parking_space            36275 non-null  int64  
 7   room_type_reserved                    36275 non-null  object 
 8   lead_time                             36275 non-null  int64  
 9   arrival_year                          36275 non-null  int64  
 10  arrival_month                         36275 non-null  int64  
 11  arrival_date                          36275 non-null  int64  
 12  market_segment_type                   36275 non-null  object 
 13  repeated_guest                        36275 non-null  int64  
 14  no_of_previous_cancellations          36275 non-null  int64  
 15  no_of_previous_bookings_not_canceled  36275 non-null  int64  
 16  avg_price_per_room                    36275 non-null  float64
 17  no_of_special_requests                36275 non-null  int64  
 18  booking_status                        36275 non-null  object 
dtypes: float64(1), int64(13), object(5)
memory usage: 5.3+ MB
None
Descriptive statistics for numerical columns:
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights required_car_parking_space lead_time arrival_year arrival_month arrival_date repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests
count 36275.00000 36275.00000 36275.00000 36275.00000 36275.00000 36275.00000 36275.00000 36275.00000 36275.00000 36275.00000 36275.00000 36275.00000 36275.00000 36275.00000
mean 1.84496 0.10528 0.81072 2.20430 0.03099 85.23256 2017.82043 7.42365 15.59700 0.02564 0.02335 0.15341 103.42354 0.61966
std 0.51871 0.40265 0.87064 1.41090 0.17328 85.93082 0.38384 3.06989 8.74045 0.15805 0.36833 1.75417 35.08942 0.78624
min 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 2017.00000 1.00000 1.00000 0.00000 0.00000 0.00000 0.00000 0.00000
25% 2.00000 0.00000 0.00000 1.00000 0.00000 17.00000 2018.00000 5.00000 8.00000 0.00000 0.00000 0.00000 80.30000 0.00000
50% 2.00000 0.00000 1.00000 2.00000 0.00000 57.00000 2018.00000 8.00000 16.00000 0.00000 0.00000 0.00000 99.45000 0.00000
75% 2.00000 0.00000 2.00000 3.00000 0.00000 126.00000 2018.00000 10.00000 23.00000 0.00000 0.00000 0.00000 120.00000 1.00000
max 4.00000 10.00000 7.00000 17.00000 1.00000 443.00000 2018.00000 12.00000 31.00000 1.00000 13.00000 58.00000 540.00000 5.00000
Descriptive statistics for non-numerical columns:
Booking_ID type_of_meal_plan room_type_reserved market_segment_type booking_status
count 36275 36275 36275 36275 36275
unique 36275 4 7 5 2
top INN36275 Meal Plan 1 Room_Type 1 Online Not_Canceled
freq 1 27835 28130 23214 24390
Missing values per column:
0
Booking_ID 0
no_of_adults 0
no_of_children 0
no_of_weekend_nights 0
no_of_week_nights 0
type_of_meal_plan 0
required_car_parking_space 0
room_type_reserved 0
lead_time 0
arrival_year 0
arrival_month 0
arrival_date 0
market_segment_type 0
repeated_guest 0
no_of_previous_cancellations 0
no_of_previous_bookings_not_canceled 0
avg_price_per_room 0
no_of_special_requests 0
booking_status 0

  • The dataset looks clean — 36,275 bookings, 19 columns, no missing values, and booking_status is categorical (Canceled / Not_Canceled).
  • There are 5 object type while the rest columns are numeric in nature

Exploratory Data Analysis (EDA)¶

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Leading Questions:

  1. What are the busiest months in the hotel?
  2. Which market segment do most of the guests come from?
  3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?
  4. What percentage of bookings are canceled?
  5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?
  6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

Univariate Analysis¶

In [70]:
# Ensure plots display clearly
plt.rcParams['figure.figsize'] = (8, 5)

# Convert target to binary for modeling later
df['booking_status_binary'] = df['booking_status'].map({'Canceled': 1, 'Not_Canceled': 0})

# --- Target variable distribution ---
sns.countplot(x='booking_status', data=df, palette='viridis')
plt.title("Booking Status Distribution")
plt.show()

# --- Univariate for numeric variables ---
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns.drop(['booking_status_binary'])

for col in numeric_cols:
    plt.figure()
    sns.histplot(df[col], kde=True, bins=30, palette='viridis')
    plt.title(f"Distribution of {col}")
    plt.show()

# --- Univariate for categorical variables ---
categorical_cols = df.select_dtypes(include=['object']).columns.drop(['Booking_ID', 'booking_status'])

for col in categorical_cols:
    plt.figure()
    sns.countplot(x=col, data=df, order=df[col].value_counts().index, palette='viridis')
    plt.title(f"{col} distribution")
    plt.xticks(rotation=45)
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

We observe that:

  • Target variable: Imbalanced — more Not_Canceled than Canceled.

  • Numeric features: Some are skewed (lead_time, avg_price_per_room).

  • Categorical features: Certain categories dominate (e.g., Meal Plan 1, Room_Type 1, Online segment).

In [71]:
# function to create labeled barplots


def labeled_barplot(data_df, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data_df: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data_df[feature])  # length of the column
    count = data_df[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data_df, # Added data=data_df here
        x=feature,
        palette="Paired",
        order=data_df[feature].value_counts().index[:n],
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage
    plt.tight_layout()

    plt.show()  # show the plot
In [72]:
cols_for_analysis = [
    'required_car_parking_space',
    'type_of_meal_plan',
    'room_type_reserved',
    'arrival_month',
    'market_segment_type',
    'booking_status',
    'no_of_children',
    'no_of_week_nights',
    'no_of_weekend_nights',
    'no_of_adults',
]

for col in cols_for_analysis:
    labeled_barplot(data_df=df, feature=col, perc=True)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [73]:
def histogram_boxplot(data_df, feature, figsize=(15, 10), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data_df: dataframe
    feature: dataframe column
    figsize: size of figure (default (15,10))
    kde: whether to show the density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data_df, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data_df, x=feature, kde=kde, ax=ax_hist2, bins=bins
    ) if bins else sns.histplot(
        data=data_df, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data_df[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data_df[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [74]:
# observation on average lead time
histogram_boxplot(data_df=df, feature='lead_time')
No description has been provided for this image
In [75]:
# observation on average price per room
histogram_boxplot(data_df=df, feature='avg_price_per_room')
No description has been provided for this image
In [76]:
# observation on number of previous booking not canceled
histogram_boxplot(data_df=df, feature='no_of_previous_bookings_not_canceled')
No description has been provided for this image
In [77]:
# Observations on number of previous booking cancellations
histogram_boxplot(data_df=df, feature='no_of_previous_cancellations')
No description has been provided for this image

Based on the univariate analysis performed:

Categorical Features:

  • required_car_parking_space: A vast majority of bookings (96.9%) do not require a car parking space, while only a small percentage (3.1%) do.
  • type_of_meal_plan: Meal Plan 1 is the most popular choice (76.7%), followed by "Not Selected" (14.1%) and Meal Plan 2 (9.1%). Meal Plan 3 and Meal Plan are very rarely selected.
  • room_type_reserved: Room Type 1 is the most frequently reserved room type (77.5%), with Room Type 4 being the second most common (16.7%). Other room types are reserved much less frequently.
  • arrival_month: October, September, and August appear to be the busiest months in terms of the number of bookings. There's a clear seasonal pattern in bookings.
  • market_segment_type: The "Online" market segment accounts for the largest proportion of bookings (64.0%), followed by "Offline" (29.0%). "Corporate", "Complementary", and "Aviation" segments have significantly fewer bookings.
  • booking_status: The dataset shows an imbalance in the target variable, with approximately 67.2% of bookings being 'Not_Canceled' and 32.8% being 'Canceled'.
  • no_of_children: The vast majority of bookings (92.6%) have no children. Bookings with 1 or 2 children are less common, and bookings with more than 2 children are very rare.
  • no_of_week_nights: The most frequent number of week nights stayed is 2, followed by 1 and 3. The distribution is skewed towards fewer week nights.
  • no_of_weekend_nights: The most frequent number of weekend nights stayed is 0, followed by 1 and 2. This suggests that many bookings might not include a weekend stay.
  • no_of_adults: The majority of bookings are for 2 adults (72.0%), followed by 1 adult (21.2%). Bookings with 3 or 4 adults are less common.

Numerical Features (from Histograms and Boxplots):

  • lead_time: The distribution of lead time is skewed to the right, with a large concentration of bookings made with a short lead time. There are also some outliers with very long lead times.
  • avg_price_per_room: The distribution of average price per room appears to be roughly unimodal but slightly skewed to the right. There are some outliers with very high average prices.
  • no_of_previous_bookings_not_canceled: The vast majority of guests have not had previous bookings that were not canceled. The distribution is heavily skewed towards 0. There are some outliers with a high number of previous non-canceled bookings.
  • no_of_previous_cancellations: Similar to previous non-canceled bookings, most guests have not had previous cancellations. The distribution is heavily skewed towards 0, with a few outliers representing guests with multiple previous cancellations.

Bivariate Analysis¶

In [78]:
# Numerical columns for bivariate analysis
numerical_cols_bivariate = [
    'no_of_adults',
    'no_of_children',
    'no_of_weekend_nights',
    'no_of_week_nights',
    'lead_time',
    'no_of_previous_cancellations',
    'no_of_previous_bookings_not_canceled',
    'avg_price_per_room',
    'no_of_special_requests'
]

# Create box plots for each numerical column against booking_status
plt.figure(figsize=(15, 20))
for i, col in enumerate(numerical_cols_bivariate):
    plt.subplot(5, 2, i + 1)
    sns.boxplot(data=df, x='booking_status', y=col, palette='viridis')
    plt.title(f'Booking Status vs. {col}')
    plt.xlabel('Booking Status')
    plt.ylabel(col)
plt.tight_layout()
plt.show()
No description has been provided for this image
In [79]:
categorical_cols_bivariate = [
    'type_of_meal_plan',
    'room_type_reserved',
    'market_segment_type',
    'arrival_year',
    'arrival_month',
    'arrival_date',
    'required_car_parking_space',
    'repeated_guest'
]

for col in categorical_cols_bivariate:
    print(f"Bivariate analysis for {col} and booking_status:")
    crosstab_df = pd.crosstab(df[col], df['booking_status'], normalize='index') * 100
    crosstab_df.plot(kind='bar', stacked=True, figsize=(10, 6), colormap='viridis')
    plt.title(f'Booking Status Distribution by {col}')
    plt.xlabel(col)
    plt.ylabel('Percentage (%)')
    plt.xticks(rotation=45, ha='right')
    plt.legend(title='Booking Status')
    plt.tight_layout()
    plt.show()
Bivariate analysis for type_of_meal_plan and booking_status:
No description has been provided for this image
Bivariate analysis for room_type_reserved and booking_status:
No description has been provided for this image
Bivariate analysis for market_segment_type and booking_status:
No description has been provided for this image
Bivariate analysis for arrival_year and booking_status:
No description has been provided for this image
Bivariate analysis for arrival_month and booking_status:
No description has been provided for this image
Bivariate analysis for arrival_date and booking_status:
No description has been provided for this image
Bivariate analysis for required_car_parking_space and booking_status:
No description has been provided for this image
Bivariate analysis for repeated_guest and booking_status:
No description has been provided for this image
In [80]:
# Categorical columns for bivariate analysis with booking_status
categorical_cols_bivariate_target = [
    'type_of_meal_plan',
    'room_type_reserved',
    'market_segment_type',
    'arrival_year',
    'arrival_month',
    'arrival_date',
    'required_car_parking_space',
    'repeated_guest'
]

# Create grouped bar plots for each categorical column against booking_status
plt.figure(figsize=(15, 25)) # Adjusted figure size for better layout
for i, col in enumerate(categorical_cols_bivariate_target):
    plt.subplot(4, 2, i + 1) # Arrange plots in a 4x2 grid
    sns.countplot(data=df, x=col, hue='booking_status', palette='viridis')
    plt.title(f'Booking Status vs. {col}')
    plt.xlabel(col)
    plt.ylabel('Number of Bookings')
    if col in ['room_type_reserved', 'market_segment_type', 'arrival_date']: # Rotate labels for columns with many categories
        plt.xticks(rotation=45, ha='right')
    else:
        plt.xticks(rotation=0)
plt.tight_layout()
plt.show()
No description has been provided for this image

Bivariate Analysis Summary¶

Based on the bivariate analysis of numerical and categorical features against the booking_status:

Numerical Features:

  • no_of_adults, no_of_children, no_of_weekend_nights, no_of_week_nights: The box plots show some differences in the distributions of these features between canceled and not-canceled bookings, but the overlap is significant. This suggests they might have some predictive power, but likely not strong on their own.
  • lead_time: Bookings with longer lead times appear to have a higher tendency to be canceled, as indicated by the higher median and wider spread of lead time for canceled bookings compared to not-canceled bookings.
  • no_of_previous_cancellations: Guests with previous cancellations are more likely to cancel future bookings. The box plot shows that canceled bookings have a higher number of previous cancellations on average, with some outliers having a significant number of past cancellations.
  • no_of_previous_bookings_not_canceled: Guests with a higher number of previous non-canceled bookings are less likely to cancel. The distribution for not-canceled bookings is skewed towards a higher number of previous successful bookings.
  • avg_price_per_room: There appears to be a slight difference in the average price per room between canceled and not-canceled bookings, with canceled bookings having a slightly higher median price. However, there is a large overlap in the distributions.
  • no_of_special_requests: Bookings with fewer special requests seem to have a higher cancellation rate. As the number of special requests increases, the cancellation percentage appears to decrease significantly. This is a notable observation.

Categorical Features:

  • type_of_meal_plan: There are variations in cancellation rates across different meal plans. Meal Plan 2 seems to have a higher cancellation percentage compared to Meal Plan 1 and "Not Selected".
  • room_type_reserved: Some room types might have higher or lower cancellation rates than others. Room Type 4 and Room Type 6 show a relatively higher proportion of cancellations compared to Room Type 1.
  • market_segment_type: The market segment has a clear impact on cancellation rates. The "Online" segment appears to have a higher cancellation rate compared to the "Offline" and "Corporate" segments. "Complementary" and "Aviation" have very few bookings, making it harder to draw strong conclusions, but the cancellation rate for Aviation appears to be high.
  • arrival_year: The cancellation rate appears to be higher in 2018 compared to 2017.
  • arrival_month: There are fluctuations in cancellation rates across months. Some months, like July and October, show a higher percentage of cancellations.
  • arrival_date: The day of the month also seems to influence cancellation rates, with certain dates showing higher cancellation percentages.
  • required_car_parking_space: Bookings requiring a parking space have a significantly lower cancellation rate compared to those that do not require one.
  • repeated_guest: Repeating guests have a much lower cancellation rate than new guests. This aligns with the observation from the univariate analysis.

Overall, features like lead_time, no_of_previous_cancellations, no_of_previous_bookings_not_canceled, no_of_special_requests, market_segment_type, arrival_year, arrival_month, arrival_date, required_car_parking_space, and repeated_guest show potential in predicting booking cancellations.

Leading Questions:

1. What are the busiest months in the hotel?

In [81]:
# Univariate analysis on arrival_month

plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='arrival_month', palette='viridis')
plt.title('Distribution of Bookings Across Months')
plt.xlabel('Month')
plt.ylabel('Number of Bookings')
plt.xticks(ticks=range(0, 12), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()
No description has been provided for this image

_ We can observe that the month of October has the highest booking across. The 3 busiest months are October, September , August.

2. Which market segment do most of the guests come from?

In [82]:
# Univariate analysis of market_segment_type

plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='market_segment_type', palette='viridis')
plt.title('Distribution of Market Segment Type')
plt.xlabel('Market Segment Type')
plt.ylabel('Number of Bookings')
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image
  • We can observe that most of the guests come from the online market segment which is almost 24000 by eye balling. The next highest is offline which is alike above 10000.

3. Hotel rates are dynamic and change according to demand and customer demographics. What are the differences in room prices in different market segments?

In [83]:
# Group by market segment type and calculate the average price per room
average_price_by_segment = df.groupby('market_segment_type')['avg_price_per_room'].mean().sort_values(ascending=False)

# Plot the average price per room for each market segment
plt.figure(figsize=(10, 6))
sns.barplot(x=average_price_by_segment.index, y=average_price_by_segment.values, palette='viridis')
plt.title('Average Price per Room by Market Segment')
plt.xlabel('Market Segment Type')
plt.ylabel('Average Price per Room')
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image
  • we can observe that the average prices in the different market sectors varies with online price being the highest, aviation and offline follows simultaneously. While complementary has the lowest price.

4. What percentage of bookings are canceled?

In [84]:
# Calculate the percentage of canceled and not-canceled bookings
cancellation_counts = df['booking_status'].value_counts(normalize=True) * 100

# Create a bar plot
plt.figure(figsize=(6, 4))
sns.barplot(x=cancellation_counts.index, y=cancellation_counts.values, palette='viridis')
plt.title('Overall Booking Cancellation Percentage')
plt.xlabel('Booking Status')
plt.ylabel('Percentage (%)')
plt.ylim(0, 100)
plt.show()
No description has been provided for this image
In [85]:
# Calculate the percentage of canceled bookings
cancellation_percentage = (df['booking_status'].value_counts(normalize=True) * 100).loc['Canceled']

print(f"Percentage of canceled bookings: {cancellation_percentage:.2f}%")
Percentage of canceled bookings: 32.76%
  • We can observe that the percentage of canceled bookings is 32.76%

5. Repeating guests are the guests who stay in the hotel often and are important to brand equity. What percentage of repeating guests cancel?

In [86]:
# Filter data for repeating guests
repeating_guests_df = df[df['repeated_guest'] == 1]

# Calculate the percentage of canceled and not-canceled bookings among repeating guests
repeating_guest_cancellation_counts = repeating_guests_df['booking_status'].value_counts(normalize=True) * 100

# Create a bar plot
plt.figure(figsize=(6, 4))
sns.barplot(x=repeating_guest_cancellation_counts.index, y=repeating_guest_cancellation_counts.values, palette='viridis')
plt.title('Booking Cancellation Percentage Among Repeating Guests')
plt.xlabel('Booking Status')
plt.ylabel('Percentage (%)')
plt.ylim(0, 100)
plt.show()
No description has been provided for this image
In [87]:
# Filter data for repeating guests
repeating_guests_df = df[df['repeated_guest'] == 1]

# Calculate the percentage of canceled bookings among repeating guests
repeating_guest_cancellation_percentage = (repeating_guests_df['booking_status'].value_counts(normalize=True) * 100).get('Canceled', 0)

print(f"Percentage of canceled bookings among repeating guests: {repeating_guest_cancellation_percentage:.2f}%")
Percentage of canceled bookings among repeating guests: 1.72%
  • we can observe the percentage of cancelled bookings among repeated guests is 1.72%

6. Many guests have special requirements when booking a hotel room. Do these requirements affect booking cancellation?

In [88]:
# Group by number of special requests and calculate cancellation percentage
cancellation_by_special_requests = df.groupby('no_of_special_requests')['booking_status'].value_counts(normalize=True).unstack().fillna(0)

# Get the cancellation percentage for each number of special requests
cancellation_percentage_by_request = cancellation_by_special_requests['Canceled'] * 100

# Plot the cancellation percentage by number of special requests
plt.figure(figsize=(10, 6))
sns.barplot(x=cancellation_percentage_by_request.index, y=cancellation_percentage_by_request.values, palette='viridis')
plt.title('Cancellation Percentage by Number of Special Requests')
plt.xlabel('Number of Special Requests')
plt.ylabel('Cancellation Percentage (%)')
plt.show()
No description has been provided for this image
  • We can observe that the soecial request does not have any effect on booking cancellation as the no of guest with no special request has thw highest cancellation percentage.

Data Preprocessing¶

  • Missing value treatment (if needed)
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)

Let's drop the Booking_ID column first before we proceed forward.

In [89]:
df.drop('Booking_ID', axis=1, inplace=True)
In [90]:
df.head()
Out[90]:
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status booking_status_binary
0 2 0 1 2 Meal Plan 1 0 Room_Type 1 224 2017 10 2 Offline 0 0 0 65.00000 0 Not_Canceled 0
1 2 0 2 3 Not Selected 0 Room_Type 1 5 2018 11 6 Online 0 0 0 106.68000 1 Not_Canceled 0
2 1 0 2 1 Meal Plan 1 0 Room_Type 1 1 2018 2 28 Online 0 0 0 60.00000 0 Canceled 1
3 2 0 0 2 Meal Plan 1 0 Room_Type 1 211 2018 5 20 Online 0 0 0 100.00000 0 Canceled 1
4 2 0 1 1 Not Selected 0 Room_Type 1 48 2018 4 11 Online 0 0 0 94.50000 0 Canceled 1

Outlier Check¶

  • Let's check for outliers in the data.
In [91]:
# outlier detection using boxplot
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
# dropping booking_status
numeric_columns.remove("booking_status_binary") # Use booking_status_binary as we created a binary target column

plt.figure(figsize=(15, 12))

for i, variable in enumerate(numeric_columns):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(df[variable], whis=1.5) # Use df instead of data
    plt.tight_layout()
    plt.title(variable)

plt.show()
No description has been provided for this image

Observations

  • There are quite a few outliers in the data.
  • However, we will not treat them as they are proper values

EDA¶

  • It is a good idea to explore the data once again after manipulating it.
In [92]:
# explore EDA once again
df.head()
Out[92]:
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights type_of_meal_plan required_car_parking_space room_type_reserved lead_time arrival_year arrival_month arrival_date market_segment_type repeated_guest no_of_previous_cancellations no_of_previous_bookings_not_canceled avg_price_per_room no_of_special_requests booking_status booking_status_binary
0 2 0 1 2 Meal Plan 1 0 Room_Type 1 224 2017 10 2 Offline 0 0 0 65.00000 0 Not_Canceled 0
1 2 0 2 3 Not Selected 0 Room_Type 1 5 2018 11 6 Online 0 0 0 106.68000 1 Not_Canceled 0
2 1 0 2 1 Meal Plan 1 0 Room_Type 1 1 2018 2 28 Online 0 0 0 60.00000 0 Canceled 1
3 2 0 0 2 Meal Plan 1 0 Room_Type 1 211 2018 5 20 Online 0 0 0 100.00000 0 Canceled 1
4 2 0 1 1 Not Selected 0 Room_Type 1 48 2018 4 11 Online 0 0 0 94.50000 0 Canceled 1
In [93]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36275 entries, 0 to 36274
Data columns (total 19 columns):
 #   Column                                Non-Null Count  Dtype  
---  ------                                --------------  -----  
 0   no_of_adults                          36275 non-null  int64  
 1   no_of_children                        36275 non-null  int64  
 2   no_of_weekend_nights                  36275 non-null  int64  
 3   no_of_week_nights                     36275 non-null  int64  
 4   type_of_meal_plan                     36275 non-null  object 
 5   required_car_parking_space            36275 non-null  int64  
 6   room_type_reserved                    36275 non-null  object 
 7   lead_time                             36275 non-null  int64  
 8   arrival_year                          36275 non-null  int64  
 9   arrival_month                         36275 non-null  int64  
 10  arrival_date                          36275 non-null  int64  
 11  market_segment_type                   36275 non-null  object 
 12  repeated_guest                        36275 non-null  int64  
 13  no_of_previous_cancellations          36275 non-null  int64  
 14  no_of_previous_bookings_not_canceled  36275 non-null  int64  
 15  avg_price_per_room                    36275 non-null  float64
 16  no_of_special_requests                36275 non-null  int64  
 17  booking_status                        36275 non-null  object 
 18  booking_status_binary                 36275 non-null  int64  
dtypes: float64(1), int64(14), object(4)
memory usage: 5.3+ MB

Building a Logistic Regression model¶

In [94]:
# defining a function to compute different metrics to check performance of a classification model built using statsmodels
def model_performance_classification_statsmodels(
    model, predictors, target, threshold=0.5
):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """

    # checking which probabilities are greater than threshold
    pred_temp = model.predict(predictors) > threshold
    # rounding off the above values to get classes
    pred = np.round(pred_temp)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [95]:
# defining a function to plot the confusion_matrix of a classification model


def confusion_matrix_statsmodels(model, predictors, target, threshold=0.5):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    threshold: threshold for classifying the observation as class 1
    """
    y_pred = model.predict(predictors) > threshold
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Logistic Regression (with statsmodels library)¶

  • We want to predict which bookings will be canceled.
  • Before we proceed to build a model, we'll have to encode categorical features.
  • We'll split the data into train and test to be able to evaluate the model that we build on the train data.
In [96]:
X = df.drop(["booking_status", "booking_status_binary"], axis=1) # Dropping both original and binary target columns from features
Y = df["booking_status_binary"] # Use the binary target column

# adding constant
X = sm.add_constant(X)

#create dummies for X
X = pd.get_dummies(X, drop_first=True) # Drop the first dummy variable for each category to handle multicollinearity

# Convert boolean columns to integers
for col in X.columns:
    if X[col].dtype == 'bool':
        X[col] = X[col].astype(int)


# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
In [97]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (25392, 28)
Shape of test set :  (10883, 28)
Percentage of classes in training set:
booking_status_binary
0   0.67064
1   0.32936
Name: proportion, dtype: float64
Percentage of classes in test set:
booking_status_binary
0   0.67638
1   0.32362
Name: proportion, dtype: float64
  • We had seen that around 67.64% of observations belongs to class 0 (Not Canceled) and 32.36% observations belongs to class 1 (Canceled), and this is preserved in the train and test sets

Checking Multicollinearity¶

  • In order to make statistical inferences from a logistic regression model, it is important to ensure that there is no multicollinearity present in the data.

Checking Multicollinearity using VIF¶

In [98]:
# Check for multicollinearity using VIF

# Check for and handle non-finite values in X_train
if not np.isfinite(X_train).all().all():
    print("Warning: Non-finite values found in X_train. Replacing with NaN and potentially dropping rows/columns or imputing.")
    # Option 1: Replace non-finite values with NaN
    X_train = X_train.replace([np.inf, -np.inf], np.nan)
    # Option 2: Drop rows or columns with NaN values (choose based on data analysis)
    X_train = X_train.dropna() # Example: dropping rows with NaN
    # Option 3: Impute NaN values (e.g., with mean, median, or mode)
    # from sklearn.imputer import SimpleImputer
    # imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
    # X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)

# Create a DataFrame to store VIF values
vif_data = pd.DataFrame()
vif_data["feature"] = X_train.columns

# Calculate VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X_train.values, i)
                   for i in range(len(X_train.columns))]

# Print the VIF values
print(vif_data)
                                 feature            VIF
0                                  const 39468156.70600
1                           no_of_adults        1.34815
2                         no_of_children        1.97823
3                   no_of_weekend_nights        1.06948
4                      no_of_week_nights        1.09567
5             required_car_parking_space        1.03993
6                              lead_time        1.39491
7                           arrival_year        1.43083
8                          arrival_month        1.27567
9                           arrival_date        1.00674
10                        repeated_guest        1.78352
11          no_of_previous_cancellations        1.39569
12  no_of_previous_bookings_not_canceled        1.65199
13                    avg_price_per_room        2.05042
14                no_of_special_requests        1.24728
15         type_of_meal_plan_Meal Plan 2        1.27185
16         type_of_meal_plan_Meal Plan 3        1.02522
17        type_of_meal_plan_Not Selected        1.27218
18        room_type_reserved_Room_Type 2        1.10144
19        room_type_reserved_Room_Type 3        1.00330
20        room_type_reserved_Room_Type 4        1.36152
21        room_type_reserved_Room_Type 5        1.02781
22        room_type_reserved_Room_Type 6        1.97307
23        room_type_reserved_Room_Type 7        1.11512
24     market_segment_type_Complementary        4.50011
25         market_segment_type_Corporate       16.92844
26           market_segment_type_Offline       64.11392
27            market_segment_type_Online       71.17643

We can observe that several features have very low VIF values, indicating low multicollinearity.

Features with high VIF are market_segment_type_Online, market_segment_type_Offline (related to market segment). So it's fine.

Building Logistic Regression Model¶

In [99]:
# fitting logistic regression model
logit = sm.Logit(y_train, X_train)

#fit logistic regression
lg = logit.fit()
#print summary of the model
print(lg.summary())
Warning: Maximum number of iterations has been exceeded.
         Current function value: 0.425036
         Iterations: 35
                             Logit Regression Results                            
=================================================================================
Dep. Variable:     booking_status_binary   No. Observations:                25392
Model:                             Logit   Df Residuals:                    25364
Method:                              MLE   Df Model:                           27
Date:                   Sun, 10 Aug 2025   Pseudo R-squ.:                  0.3293
Time:                           09:13:12   Log-Likelihood:                -10793.
converged:                         False   LL-Null:                       -16091.
Covariance Type:               nonrobust   LLR p-value:                     0.000
========================================================================================================
                                           coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
const                                 -924.5923    120.817     -7.653      0.000   -1161.390    -687.795
no_of_adults                             0.1135      0.038      3.017      0.003       0.040       0.187
no_of_children                           0.1563      0.057      2.732      0.006       0.044       0.268
no_of_weekend_nights                     0.1068      0.020      5.398      0.000       0.068       0.146
no_of_week_nights                        0.0398      0.012      3.239      0.001       0.016       0.064
required_car_parking_space              -1.5939      0.138    -11.561      0.000      -1.864      -1.324
lead_time                                0.0157      0.000     58.868      0.000       0.015       0.016
arrival_year                             0.4570      0.060      7.633      0.000       0.340       0.574
arrival_month                           -0.0415      0.006     -6.418      0.000      -0.054      -0.029
arrival_date                             0.0005      0.002      0.252      0.801      -0.003       0.004
repeated_guest                          -2.3469      0.617     -3.805      0.000      -3.556      -1.138
no_of_previous_cancellations             0.2664      0.086      3.108      0.002       0.098       0.434
no_of_previous_bookings_not_canceled    -0.1727      0.153     -1.131      0.258      -0.472       0.127
avg_price_per_room                       0.0188      0.001     25.404      0.000       0.017       0.020
no_of_special_requests                  -1.4690      0.030    -48.790      0.000      -1.528      -1.410
type_of_meal_plan_Meal Plan 2            0.1768      0.067      2.654      0.008       0.046       0.307
type_of_meal_plan_Meal Plan 3           17.8379   5057.771      0.004      0.997   -9895.212    9930.888
type_of_meal_plan_Not Selected           0.2782      0.053      5.245      0.000       0.174       0.382
room_type_reserved_Room_Type 2          -0.3610      0.131     -2.761      0.006      -0.617      -0.105
room_type_reserved_Room_Type 3          -0.0009      1.310     -0.001      0.999      -2.569       2.567
room_type_reserved_Room_Type 4          -0.2821      0.053     -5.305      0.000      -0.386      -0.178
room_type_reserved_Room_Type 5          -0.7176      0.209     -3.432      0.001      -1.127      -0.308
room_type_reserved_Room_Type 6          -0.9456      0.147     -6.434      0.000      -1.234      -0.658
room_type_reserved_Room_Type 7          -1.3964      0.293     -4.767      0.000      -1.971      -0.822
market_segment_type_Complementary      -41.8798   8.42e+05  -4.98e-05      1.000   -1.65e+06    1.65e+06
market_segment_type_Corporate           -1.1935      0.266     -4.487      0.000      -1.715      -0.672
market_segment_type_Offline             -2.1955      0.255     -8.625      0.000      -2.694      -1.697
market_segment_type_Online              -0.3990      0.251     -1.588      0.112      -0.891       0.093
========================================================================================================

Dropping high p-value variables¶

  • We will drop the predictor variables having a p-value greater than 0.05 as they do not significantly impact the target variable.
  • But sometimes p-values change after dropping a variable. So, we'll not drop all variables at once.
  • Instead, we will do the following:
    • Build a model, check the p-values of the variables, and drop the column with the highest p-value.
    • Create a new model without the dropped feature, check the p-values of the variables, and drop the column with the highest p-value.
    • Repeat the above two steps till there are no columns with p-value > 0.05.

The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.

In [100]:
# initial list of columns
cols = X_train.columns.tolist()

# setting an initial max p-value
max_p_value = 1

while len(cols) > 0:
    # defining the train set
    x_train_aux = X_train[cols]

    # fitting the model
    model = sm.Logit(y_train, x_train_aux).fit(disp=False)

    # getting the p-values and the maximum p-value
    p_values = model.pvalues
    max_p_value = max(p_values)

    # name of the variable with maximum p-value
    feature_with_p_max = p_values.idxmax()

    if max_p_value > 0.05:
        cols.remove(feature_with_p_max)
    else:
        break

selected_features = cols
print(selected_features)
['const', 'no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'repeated_guest', 'no_of_previous_cancellations', 'avg_price_per_room', 'no_of_special_requests', 'type_of_meal_plan_Meal Plan 2', 'type_of_meal_plan_Not Selected', 'room_type_reserved_Room_Type 2', 'room_type_reserved_Room_Type 4', 'room_type_reserved_Room_Type 5', 'room_type_reserved_Room_Type 6', 'room_type_reserved_Room_Type 7', 'market_segment_type_Corporate', 'market_segment_type_Offline']
In [101]:
X_train1 = X_train[selected_features]
X_test1 = X_test[selected_features]
In [102]:
 #rain logistic regression on X_train1 and y_train
logit1 = sm.Logit(y_train, X_train1)

 #fit logistic regression
lg1 = logit1.fit()
#print summary of the model
print(lg1.summary())
Optimization terminated successfully.
         Current function value: 0.425677
         Iterations 11
                             Logit Regression Results                            
=================================================================================
Dep. Variable:     booking_status_binary   No. Observations:                25392
Model:                             Logit   Df Residuals:                    25370
Method:                              MLE   Df Model:                           21
Date:                   Sun, 10 Aug 2025   Pseudo R-squ.:                  0.3283
Time:                           09:13:25   Log-Likelihood:                -10809.
converged:                          True   LL-Null:                       -16091.
Covariance Type:               nonrobust   LLR p-value:                     0.000
==================================================================================================
                                     coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
const                           -917.2860    120.456     -7.615      0.000   -1153.376    -681.196
no_of_adults                       0.1086      0.037      2.914      0.004       0.036       0.182
no_of_children                     0.1522      0.057      2.660      0.008       0.040       0.264
no_of_weekend_nights               0.1086      0.020      5.501      0.000       0.070       0.147
no_of_week_nights                  0.0418      0.012      3.403      0.001       0.018       0.066
required_car_parking_space        -1.5943      0.138    -11.561      0.000      -1.865      -1.324
lead_time                          0.0157      0.000     59.218      0.000       0.015       0.016
arrival_year                       0.4531      0.060      7.591      0.000       0.336       0.570
arrival_month                     -0.0424      0.006     -6.568      0.000      -0.055      -0.030
repeated_guest                    -2.7365      0.557     -4.915      0.000      -3.828      -1.645
no_of_previous_cancellations       0.2289      0.077      2.983      0.003       0.078       0.379
avg_price_per_room                 0.0192      0.001     26.343      0.000       0.018       0.021
no_of_special_requests            -1.4699      0.030    -48.892      0.000      -1.529      -1.411
type_of_meal_plan_Meal Plan 2      0.1654      0.067      2.487      0.013       0.035       0.296
type_of_meal_plan_Not Selected     0.2858      0.053      5.405      0.000       0.182       0.389
room_type_reserved_Room_Type 2    -0.3560      0.131     -2.725      0.006      -0.612      -0.100
room_type_reserved_Room_Type 4    -0.2826      0.053     -5.330      0.000      -0.387      -0.179
room_type_reserved_Room_Type 5    -0.7352      0.208     -3.529      0.000      -1.143      -0.327
room_type_reserved_Room_Type 6    -0.9650      0.147     -6.572      0.000      -1.253      -0.677
room_type_reserved_Room_Type 7    -1.4312      0.293     -4.892      0.000      -2.005      -0.858
market_segment_type_Corporate     -0.7928      0.103     -7.711      0.000      -0.994      -0.591
market_segment_type_Offline       -1.7867      0.052    -34.391      0.000      -1.889      -1.685
==================================================================================================
In [103]:
# check performance on X_train1 and y_train
print("Training performance:")
model_performance_classification_statsmodels(lg1, X_train1, y_train)
Training performance:
Out[103]:
Accuracy Recall Precision F1
0 0.80541 0.63255 0.73903 0.68166

Converting coefficients to odds¶

  • The coefficients of the logistic regression model are in terms of log(odd), to find the odds we have to take the exponential of the coefficients.
  • Therefore, odds = exp(b)
  • The percentage change in odds is given as odds = (exp(b) - 1) * 100
In [104]:
# converting coefficients to odds
odds = np.exp(lg1.params)

# finding the percentage change
perc_change_odds = (np.exp(lg1.params) - 1) * 100

# removing limit from number of columns to display
pd.set_option("display.max_columns", None)

# adding the odds to a dataframe
pd.DataFrame({"Odds": odds, "Change_odd%": perc_change_odds}, index=X_train1.columns).T
Out[104]:
const no_of_adults no_of_children no_of_weekend_nights no_of_week_nights required_car_parking_space lead_time arrival_year arrival_month repeated_guest no_of_previous_cancellations avg_price_per_room no_of_special_requests type_of_meal_plan_Meal Plan 2 type_of_meal_plan_Not Selected room_type_reserved_Room_Type 2 room_type_reserved_Room_Type 4 room_type_reserved_Room_Type 5 room_type_reserved_Room_Type 6 room_type_reserved_Room_Type 7 market_segment_type_Corporate market_segment_type_Offline
Odds 0.00000 1.11475 1.16436 1.11475 1.04264 0.20305 1.01584 1.57324 0.95853 0.06480 1.25716 1.01935 0.22994 1.17992 1.33089 0.70046 0.75383 0.47940 0.38099 0.23903 0.45258 0.16750
Change_odd% -100.00000 11.47536 16.43601 11.47526 4.26363 -79.69523 1.58352 57.32351 -4.14725 -93.52026 25.71567 1.93479 -77.00595 17.99156 33.08924 -29.95389 -24.61701 -52.05967 -61.90093 -76.09669 -54.74162 -83.24963

Interpretation of Logistic Regression Model Coefficients (Odds Ratios)¶

Based on the fitted logistic regression model, the following interpretations can be made regarding the impact of each significant predictor on the odds of a booking being canceled (compared to not being canceled), holding all other variables constant:

  • no_of_adults: For every one-unit increase in the number of adults, the odds of cancellation increase by approximately 11.48%.
  • no_of_children: For every one-unit increase in the number of children, the odds of cancellation increase by approximately 16.44%.
  • no_of_weekend_nights: For every one-unit increase in the number of weekend nights, the odds of cancellation increase by approximately 11.48%.
  • no_of_week_nights: For every one-unit increase in the number of week nights, the odds of cancellation increase by approximately 4.26%.
  • required_car_parking_space: Bookings that require a car parking space have approximately 79.70% lower odds of being canceled compared to those that do not require one.
  • lead_time: For every one-unit increase in lead time (number of days between booking and arrival), the odds of cancellation increase by approximately 1.58%. This indicates that bookings made further in advance are slightly more likely to be canceled.
  • arrival_year: Bookings in 2018 have approximately 57.32% higher odds of being canceled compared to bookings in 2017 (the reference year).
  • arrival_month: For every one-unit increase in the arrival month (e.g., moving from January to February), the odds of cancellation decrease by approximately 4.15%. This suggests that bookings later in the year might have slightly lower cancellation odds, although the relationship with individual months was explored in EDA.
  • repeated_guest: Repeating guests have approximately 93.52% lower odds of being canceled compared to new guests. This highlights the loyalty of returning customers.
  • no_of_previous_cancellations: For every one-unit increase in the number of previous cancellations, the odds of cancellation increase by approximately 25.72%. Guests with a history of canceling are more likely to cancel again.
  • avg_price_per_room: For every one-unit increase in the average price per room, the odds of cancellation increase by approximately 1.93%. Higher prices are slightly associated with higher cancellation odds.
  • no_of_special_requests: For every one-unit increase in the number of special requests, the odds of cancellation decrease by approximately 77.01%. Bookings with more special requests are significantly less likely to be canceled.
  • type_of_meal_plan_Meal Plan 2: Bookings with Meal Plan 2 have approximately 17.99% higher odds of being canceled compared to bookings with Meal Plan 1 (the reference category).
  • type_of_meal_plan_Not Selected: Bookings with Not Selected meal plan have approximately 33.09% higher odds of being canceled compared to bookings with Meal Plan 1.
  • room_type_reserved_Room_Type 2: Bookings with Room Type 2 have approximately 29.95% lower odds of being canceled compared to bookings with Room Type 1 (the reference category).
  • room_type_reserved_Room_Type 4: Bookings with Room Type 4 have approximately 24.62% lower odds of being canceled compared to bookings with Room Type 1.
  • room_type_reserved_Room_Type 5: Bookings with Room Type 5 have approximately 52.06% lower odds of being canceled compared to bookings with Room Type 1.
  • room_type_reserved_Room_Type 6: Bookings with Room Type 6 have approximately 61.90% lower odds of being canceled compared to bookings with Room Type 1.
  • room_type_reserved_Room_Type 7: Bookings with Room Type 7 have approximately 76.10% lower odds of being canceled compared to bookings with Room Type 1.
  • market_segment_type_Corporate: Bookings from the Corporate market segment have approximately 54.74% lower odds of being canceled compared to bookings from the Online market segment (the reference category).
  • market_segment_type_Offline: Bookings from the Offline market segment have approximately 83.25% lower odds of being canceled compared to bookings from the Online market segment.

These interpretations highlight the factors that significantly influence the likelihood of a booking being canceled and can inform strategies for reducing cancellations.

Model performance evaluation¶

Confusion Matrix¶

In [105]:
# creating confusion matrix
confusion_matrix_statsmodels(lg1, X_train1, y_train)
No description has been provided for this image
In [106]:
print("Training performance:")
# check performance on X_train1 and y_train
log_reg_model_train_perf = model_performance_classification_statsmodels(lg1, X_train1, y_train)
log_reg_model_train_perf
Training performance:
Out[106]:
Accuracy Recall Precision F1
0 0.80541 0.63255 0.73903 0.68166

ROC-AUC¶

  • ROC-AUC on training set
In [107]:
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict(X_train1))
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
No description has been provided for this image
  • Logistic Regression model is giving a good performance on training set.

Model Performance Improvement¶

  • Let's see if the recall score can be improved further, by changing the model threshold using AUC-ROC Curve.

Final Model Summary¶

Optimal threshold using AUC-ROC curve¶

In [108]:
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict(X_train1))

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold_auc_roc = thresholds[optimal_idx]
print(optimal_threshold_auc_roc)
0.3710466623488775
In [109]:
# creating confusion matrix
# create the confusion matrix for X_train1 and y_train with optimal_threshold_auc_roc as threshold
confusion_matrix_statsmodels(
    lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
No description has been provided for this image
In [110]:
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
    lg1, X_train1, y_train, threshold=optimal_threshold_auc_roc
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
Out[110]:
Accuracy Recall Precision F1
0 0.79289 0.73562 0.66870 0.70056

Let's use Precision-Recall curve and see if we can find a better threshold¶

In [111]:
y_scores = lg1.predict(X_train1)
prec, rec, tre = precision_recall_curve(y_train, y_scores,)


def plot_prec_recall_vs_tresh(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="precision")
    plt.plot(thresholds, recalls[:-1], "g--", label="recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])


plt.figure(figsize=(10, 7))
plot_prec_recall_vs_tresh(prec, rec, tre)
plt.show()
No description has been provided for this image
In [112]:
# setting the threshold
optimal_threshold_curve = 0.42

Checking model performance on training set¶

In [113]:
# checking model performance for this model
log_reg_model_train_perf_threshold_auc_roc = model_performance_classification_statsmodels(
    lg1, X_train1, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_auc_roc
Training performance:
Out[113]:
Accuracy Recall Precision F1
0 0.80128 0.69939 0.69789 0.69864
In [114]:
# setting the threshold
optimal_threshold_curve = 0.37
In [115]:
log_reg_model_train_perf_threshold_curve = model_performance_classification_statsmodels(
    lg1, X_train1, y_train, threshold=optimal_threshold_curve
)
print("Training performance:")
log_reg_model_train_perf_threshold_curve
Training performance:
Out[115]:
Accuracy Recall Precision F1
0 0.79265 0.73622 0.66808 0.70049

Let's check the performance on the test set¶

In [116]:
# Plotting confusion matrix for the testing data
print("Confusion Matrix (Testing Data):")
confusion_matrix_statsmodels(lg1, X_test1, y_test)
plt.show()
Confusion Matrix (Testing Data):
No description has been provided for this image
In [117]:
 # check performance on X_test1 and y_test
log_reg_model_test_perf = model_performance_classification_statsmodels(lg1, X_test1, y_test)
print("Test performance:")
log_reg_model_test_perf
Test performance:
Out[117]:
Accuracy Recall Precision F1
0 0.80465 0.63089 0.72900 0.67641
  • ROC curve on test set
In [118]:
logit_roc_auc_train = roc_auc_score(y_test, lg1.predict(X_test1))
fpr, tpr, thresholds = roc_curve(y_test, lg1.predict(X_test1))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver operating characteristic")
plt.legend(loc="lower right")
plt.show()
No description has been provided for this image

Using model with threshold=0.37

In [119]:
# creating confusion matrix
# create confusion matrix for X_test1 and y_test using optimal_threshold_auc_roc as threshold
confusion_matrix_statsmodels(lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc)
No description has been provided for this image
In [120]:
# checking model performance for this model
log_reg_model_test_perf_threshold_auc_roc = model_performance_classification_statsmodels(
    lg1, X_test1, y_test, threshold=optimal_threshold_auc_roc
)
print("Test performance:")
log_reg_model_test_perf_threshold_auc_roc
Test performance:
Out[120]:
Accuracy Recall Precision F1
0 0.79601 0.73935 0.66667 0.70113

Using model with threshold = 0.42

In [121]:
# creating confusion matrix
# setting the threshold
optimal_threshold_curve = 0.42
# create confusion matrix for X_test1 and y_test using optimal_threshold_curve as threshold
confusion_matrix_statsmodels(lg1, X_test1, y_test, threshold=optimal_threshold_curve)
No description has been provided for this image
In [122]:
log_reg_model_test_perf_threshold_curve = model_performance_classification_statsmodels(
    lg1, X_test1, y_test, threshold=optimal_threshold_curve
)
print("Test performance:")
log_reg_model_test_perf_threshold_curve
Test performance:
Out[122]:
Accuracy Recall Precision F1
0 0.80364 0.70386 0.69381 0.69880

Model performance summary¶

In [123]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        log_reg_model_train_perf.T,
        log_reg_model_train_perf_threshold_auc_roc.T,
        log_reg_model_train_perf_threshold_curve.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Logistic Regression-default Threshold",
    "Logistic Regression-0.37 Threshold",
    "Logistic Regression-0.42 Threshold",
]

print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[123]:
Logistic Regression-default Threshold Logistic Regression-0.37 Threshold Logistic Regression-0.42 Threshold
Accuracy 0.80541 0.80128 0.79265
Recall 0.63255 0.69939 0.73622
Precision 0.73903 0.69789 0.66808
F1 0.68166 0.69864 0.70049
In [124]:
# test performance comparison

models_test_comp_df = pd.concat(
    [
        log_reg_model_test_perf.T,
        log_reg_model_test_perf_threshold_auc_roc.T,
        log_reg_model_test_perf_threshold_curve.T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Logistic Regression-default Threshold",
    "Logistic Regression-0.37 Threshold",
    "Logistic Regression-0.42 Threshold",
]

print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
Out[124]:
Logistic Regression-default Threshold Logistic Regression-0.37 Threshold Logistic Regression-0.42 Threshold
Accuracy 0.80465 0.79601 0.80364
Recall 0.63089 0.73935 0.70386
Precision 0.72900 0.66667 0.69381
F1 0.67641 0.70113 0.69880

General Model Performance

  • The accuracy remains fairly consistent across training and testing (~79–80%), indicating the model is not overfitting and generalizes well.

  • F1-scores are also consistent across train and test sets, showing stable precision–recall trade-offs.

  • Default threshold: Balanced precision (0.73) and recall (~0.63) on test data.

Threshold (0.37): Recall improves significantly (~ 0.74) on test set, but precision drops (~0.67)

  • Threshold (0.42): Recall improves moderately (~ 0.70), with precision (~0.69) close to default.

F1-score is slightly better than default — a balanced choice.

Building a Decision Tree model¶

Data Preparation for modeling (Decision Tree)¶

  • We want to predict which bookings will be canceled.
  • Before we proceed to build a model, we'll have to encode categorical features.
  • We'll split the data into train and test to be able to evaluate the model that we build on the train data.
In [125]:
X = df.drop(["booking_status", "booking_status_binary"], axis=1) # Drop both original and binary target
Y = df["booking_status_binary"] # Use the binary target column

# reate dummies for X
X = pd.get_dummies(X, drop_first=True)

# Splitting data in train and test sets
# split the data into train test in the ratio 70:30 with random_state = 1
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)
In [126]:
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set :  (25392, 27)
Shape of test set :  (10883, 27)
Percentage of classes in training set:
booking_status_binary
0   0.67064
1   0.32936
Name: proportion, dtype: float64
Percentage of classes in test set:
booking_status_binary
0   0.67638
1   0.32362
Name: proportion, dtype: float64

First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.¶

  • The model_performance_classification_sklearn function will be used to check the model performance of models.
  • The confusion_matrix_sklearnfunction will be used to plot the confusion matrix.
In [127]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [128]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
In [129]:
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Initialize the Decision Tree model
dt_model = DecisionTreeClassifier(random_state=1)

# Fit the model to the training data
dt_model.fit(X_train, y_train)
Out[129]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)

Checking model performance on training set¶

Confusion Matrix¶

In [130]:
# Plotting confusion matrix for the Decision Tree model on training data
print("Decision Tree Model Confusion Matrix (Training Data):")
confusion_matrix_sklearn(dt_model, X_train, y_train)
plt.show()
Decision Tree Model Confusion Matrix (Training Data):
No description has been provided for this image
In [131]:
decision_tree_perf_train = model_performance_classification_sklearn(
    dt_model, X_train, y_train
)
decision_tree_perf_train
Out[131]:
Accuracy Recall Precision F1
0 0.99421 0.98661 0.99578 0.99117
In [132]:
# Plotting confusion matrix for the Decision Tree model on testing data
print("Decision Tree Model Confusion Matrix (Testing Data):")
confusion_matrix_sklearn(dt_model, X_test, y_test)
plt.show()
Decision Tree Model Confusion Matrix (Testing Data):
No description has been provided for this image
In [133]:
# Evaluating Decision Tree model performance on the testing data
decision_tree_perf_test = model_performance_classification_sklearn(
    dt_model, X_test, y_test
)
decision_tree_perf_test
print("Decision Tree Model Testing Performance:")
model_performance_classification_sklearn(dt_model, X_test, y_test)
Decision Tree Model Testing Performance:
Out[133]:
Accuracy Recall Precision F1
0 0.87108 0.81034 0.79521 0.80270

Observations on Decision Tree (Unpruned):

  • The unpruned Decision Tree shows very high performance on the training data (potentially close to perfect), but the performance on the testing data is significantly lower. This is a clear indication of overfitting. The model has learned the training data too well, including the noise, and does not generalize well to unseen data.

Before pruning the tree let's check the important features.

In [134]:
feature_names = list(X_train.columns)
importances = dt_model.feature_importances_ # Changed model to dt_model
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image
  • In the decision tree, Lead Time and average_price_per_room are the most important features.

Do we need to prune the tree?¶

  • Yes

Pre-Pruning

In [135]:
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight="balanced")

# Grid of parameters to choose from
parameters = {
    "max_depth": np.arange(2, 7, 2),
    "max_leaf_nodes": [50, 75, 150, 250],
    "min_samples_split": [10, 30, 50, 70],
}

# Type of scoring used to compare parameter combinations
acc_scorer = make_scorer(f1_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
Out[135]:
DecisionTreeClassifier(class_weight='balanced', max_depth=np.int64(6),
                       max_leaf_nodes=50, min_samples_split=10, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=np.int64(6),
                       max_leaf_nodes=50, min_samples_split=10, random_state=1)

Checking performance on training set¶

In [136]:
 # create confusion matrix for train data
 confusion_matrix_sklearn(estimator, X_train, y_train)
No description has been provided for this image
In [137]:
# check performance on train set
decision_tree_tune_perf_train = model_performance_classification_sklearn(estimator, X_train, y_train)
print("Training performance:")
decision_tree_tune_perf_train
Training performance:
Out[137]:
Accuracy Recall Precision F1
0 0.83101 0.78620 0.72428 0.75397

Checking performance on test set¶

In [138]:
# check performance on test set
confusion_matrix_sklearn(estimator, X_test, y_test)
No description has been provided for this image
In [139]:
## check performance on test set
decision_tree_tune_perf_test = model_performance_classification_sklearn(estimator, X_test, y_test)
print("Test performance:")
decision_tree_tune_perf_test
Test performance:
Out[139]:
Accuracy Recall Precision F1
0 0.83497 0.78336 0.72758 0.75444

Visualizing the Decision Tree¶

In [140]:
plt.figure(figsize=(20, 10))
out = tree.plot_tree(
    estimator,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
In [141]:
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50
|   |--- no_of_special_requests <= 0.50
|   |   |--- market_segment_type_Online <= 0.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 196.50
|   |   |   |   |   |   |--- weights: [1736.39, 132.08] class: 0
|   |   |   |   |   |--- avg_price_per_room >  196.50
|   |   |   |   |   |   |--- weights: [0.75, 25.81] class: 1
|   |   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |   |--- lead_time <= 68.50
|   |   |   |   |   |   |--- weights: [960.27, 223.16] class: 0
|   |   |   |   |   |--- lead_time >  68.50
|   |   |   |   |   |   |--- weights: [129.73, 160.92] class: 1
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- lead_time <= 117.50
|   |   |   |   |   |--- avg_price_per_room <= 93.58
|   |   |   |   |   |   |--- weights: [214.72, 227.72] class: 1
|   |   |   |   |   |--- avg_price_per_room >  93.58
|   |   |   |   |   |   |--- weights: [82.76, 285.41] class: 1
|   |   |   |   |--- lead_time >  117.50
|   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |--- weights: [87.23, 81.98] class: 0
|   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |--- weights: [228.14, 48.58] class: 0
|   |   |--- market_segment_type_Online >  0.50
|   |   |   |--- lead_time <= 13.50
|   |   |   |   |--- avg_price_per_room <= 99.44
|   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |--- weights: [92.45, 0.00] class: 0
|   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |--- weights: [363.83, 132.08] class: 0
|   |   |   |   |--- avg_price_per_room >  99.44
|   |   |   |   |   |--- lead_time <= 3.50
|   |   |   |   |   |   |--- weights: [219.94, 85.01] class: 0
|   |   |   |   |   |--- lead_time >  3.50
|   |   |   |   |   |   |--- weights: [132.71, 280.85] class: 1
|   |   |   |--- lead_time >  13.50
|   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 71.92
|   |   |   |   |   |   |--- weights: [158.80, 159.40] class: 1
|   |   |   |   |   |--- avg_price_per_room >  71.92
|   |   |   |   |   |   |--- weights: [850.67, 3543.28] class: 1
|   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |--- weights: [48.46, 1.52] class: 0
|   |--- no_of_special_requests >  0.50
|   |   |--- no_of_special_requests <= 1.50
|   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |--- lead_time <= 102.50
|   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |--- weights: [697.09, 9.11] class: 0
|   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |--- weights: [15.66, 9.11] class: 0
|   |   |   |   |--- lead_time >  102.50
|   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |--- weights: [32.06, 19.74] class: 0
|   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |--- weights: [44.73, 3.04] class: 0
|   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |--- lead_time <= 8.50
|   |   |   |   |   |--- lead_time <= 4.50
|   |   |   |   |   |   |--- weights: [498.03, 44.03] class: 0
|   |   |   |   |   |--- lead_time >  4.50
|   |   |   |   |   |   |--- weights: [258.71, 63.76] class: 0
|   |   |   |   |--- lead_time >  8.50
|   |   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |   |--- weights: [2512.51, 1451.32] class: 0
|   |   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |   |--- weights: [134.20, 1.52] class: 0
|   |   |--- no_of_special_requests >  1.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |--- weights: [1585.04, 0.00] class: 0
|   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- weights: [180.42, 57.69] class: 0
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [52.19, 0.00] class: 0
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |--- weights: [184.90, 56.17] class: 0
|   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |--- weights: [106.61, 106.27] class: 0
|   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |--- weights: [67.10, 0.00] class: 0
|--- lead_time >  151.50
|   |--- avg_price_per_room <= 100.04
|   |   |--- no_of_special_requests <= 0.50
|   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |   |--- lead_time <= 163.50
|   |   |   |   |   |   |--- weights: [3.73, 24.29] class: 1
|   |   |   |   |   |--- lead_time >  163.50
|   |   |   |   |   |   |--- weights: [257.96, 62.24] class: 0
|   |   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |   |--- avg_price_per_room <= 2.50
|   |   |   |   |   |   |--- weights: [8.95, 3.04] class: 0
|   |   |   |   |   |--- avg_price_per_room >  2.50
|   |   |   |   |   |   |--- weights: [0.75, 97.16] class: 1
|   |   |   |--- no_of_adults >  1.50
|   |   |   |   |--- avg_price_per_room <= 82.47
|   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |--- weights: [2.98, 282.37] class: 1
|   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |--- weights: [213.97, 385.60] class: 1
|   |   |   |   |--- avg_price_per_room >  82.47
|   |   |   |   |   |--- no_of_adults <= 2.50
|   |   |   |   |   |   |--- weights: [23.86, 1030.80] class: 1
|   |   |   |   |   |--- no_of_adults >  2.50
|   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |--- no_of_special_requests >  0.50
|   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |--- lead_time <= 180.50
|   |   |   |   |   |--- lead_time <= 159.50
|   |   |   |   |   |   |--- weights: [7.46, 7.59] class: 1
|   |   |   |   |   |--- lead_time >  159.50
|   |   |   |   |   |   |--- weights: [37.28, 4.55] class: 0
|   |   |   |   |--- lead_time >  180.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- weights: [20.13, 212.54] class: 1
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [8.95, 0.00] class: 0
|   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |--- weights: [231.12, 110.82] class: 0
|   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |--- weights: [19.38, 34.92] class: 1
|   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |--- lead_time <= 348.50
|   |   |   |   |   |   |--- weights: [106.61, 3.04] class: 0
|   |   |   |   |   |--- lead_time >  348.50
|   |   |   |   |   |   |--- weights: [5.96, 4.55] class: 0
|   |--- avg_price_per_room >  100.04
|   |   |--- arrival_month <= 11.50
|   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |--- weights: [0.00, 3200.19] class: 1
|   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |--- weights: [23.11, 0.00] class: 0
|   |   |--- arrival_month >  11.50
|   |   |   |--- no_of_special_requests <= 0.50
|   |   |   |   |--- weights: [35.04, 0.00] class: 0
|   |   |   |--- no_of_special_requests >  0.50
|   |   |   |   |--- arrival_date <= 24.50
|   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |--- arrival_date >  24.50
|   |   |   |   |   |--- weights: [3.73, 22.77] class: 1

In [142]:
# importance of features in the tree building

importances = estimator.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image
  • In the pre tuned decision tree, Lead time and market_segment_type_online are the most important features.

Post Pruning

Cost Complexity Pruning

In [143]:
clf = DecisionTreeClassifier(random_state=1, class_weight="balanced")
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = abs(path.ccp_alphas), path.impurities
In [144]:
pd.DataFrame(path)
Out[144]:
ccp_alphas impurities
0 0.00000 0.00838
1 -0.00000 0.00838
2 0.00000 0.00838
3 0.00000 0.00838
4 0.00000 0.00838
... ... ...
1837 0.00890 0.32806
1838 0.00980 0.33786
1839 0.01272 0.35058
1840 0.03412 0.41882
1841 0.08118 0.50000

1842 rows × 2 columns

In [145]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
No description has been provided for this image

Next, we train a decision tree using effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

In [146]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(
        random_state=1, ccp_alpha=ccp_alpha, class_weight="balanced"
    )
    clf.fit(X_train, y_train) ## Complete the code to fit decision tree on training data
    clfs.append(clf)
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.08117914389136943
In [147]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
No description has been provided for this image

F1 Score vs alpha for training and testing sets¶

In [148]:
f1_train = []
for clf in clfs:
    pred_train = clf.predict(X_train)
    values_train = f1_score(y_train, pred_train)
    f1_train.append(values_train)

f1_test = []
for clf in clfs:
    pred_test = clf.predict(X_test)
    values_test = f1_score(y_test, pred_test)
    f1_test.append(values_test)
In [149]:
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("F1 Score")
ax.set_title("F1 Score vs alpha for training and testing sets")
ax.plot(ccp_alphas, f1_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, f1_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
No description has been provided for this image
In [150]:
index_best_model = np.argmax(f1_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=np.float64(0.0001226763315516701),
                       class_weight='balanced', random_state=1)

Checking performance on training set¶

In [151]:
confusion_matrix_sklearn(best_model, X_train, y_train)
No description has been provided for this image
In [152]:
decision_tree_post_perf_train = model_performance_classification_sklearn(
    best_model, X_train, y_train
)
decision_tree_post_perf_train
Out[152]:
Accuracy Recall Precision F1
0 0.90005 0.90350 0.81361 0.85620

Checking performance on test set¶

In [153]:
# create confusion matrix for test data on best model
confusion_matrix_sklearn(best_model, X_test, y_test)
No description has been provided for this image
In [154]:
# check performance of test set on best model
decision_tree_post_perf_test = model_performance_classification_sklearn(
    best_model, X_test, y_test
)
decision_tree_post_perf_test
Out[154]:
Accuracy Recall Precision F1
0 0.86869 0.85576 0.76595 0.80837
In [155]:
plt.figure(figsize=(20, 10))

out = tree.plot_tree(
    best_model,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
No description has been provided for this image
  • We can see that the observation we got from the pre-pruned tree is not matching with the decision tree rules of the post pruned tree.
In [156]:
# Text report showing the rules of a decision tree -

print(tree.export_text(best_model, feature_names=feature_names, show_weights=True))
|--- lead_time <= 151.50
|   |--- no_of_special_requests <= 0.50
|   |   |--- market_segment_type_Online <= 0.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 196.50
|   |   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |   |--- lead_time <= 16.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 68.50
|   |   |   |   |   |   |   |   |   |--- weights: [207.26, 10.63] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  68.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 29.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 5
|   |   |   |   |   |   |   |   |   |--- arrival_date >  29.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 7.59] class: 1
|   |   |   |   |   |   |   |--- lead_time >  16.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 135.00
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |   |--- repeated_guest <= 0.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |   |--- repeated_guest >  0.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.18, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [21.62, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  135.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 12.14] class: 1
|   |   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |   |--- weights: [1199.59, 0.00] class: 0
|   |   |   |   |   |--- avg_price_per_room >  196.50
|   |   |   |   |   |   |--- weights: [0.75, 25.81] class: 1
|   |   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |   |--- lead_time <= 68.50
|   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 63.29
|   |   |   |   |   |   |   |   |--- arrival_date <= 20.50
|   |   |   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [41.75, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 3.04] class: 1
|   |   |   |   |   |   |   |   |--- arrival_date >  20.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 59.75
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 23.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.49, 12.14] class: 1
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  23.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [14.91, 1.52] class: 0
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  59.75
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 44.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 59.21] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  44.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |--- avg_price_per_room >  63.29
|   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 3.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 59.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 7.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  7.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- lead_time >  59.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 5.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  5.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [20.13, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  3.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.75, 15.18] class: 1
|   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |--- weights: [413.04, 27.33] class: 0
|   |   |   |   |   |--- lead_time >  68.50
|   |   |   |   |   |   |--- avg_price_per_room <= 99.98
|   |   |   |   |   |   |   |--- arrival_month <= 3.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 62.50
|   |   |   |   |   |   |   |   |   |--- weights: [15.66, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  62.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 80.38
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 81.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  81.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  80.38
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |--- arrival_month >  3.50
|   |   |   |   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |   |   |   |--- weights: [55.17, 3.04] class: 0
|   |   |   |   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 73.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |   |   |--- lead_time >  73.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [21.62, 4.55] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  99.98
|   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |--- weights: [8.95, 0.00] class: 0
|   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 132.43
|   |   |   |   |   |   |   |   |   |--- weights: [9.69, 122.97] class: 1
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  132.43
|   |   |   |   |   |   |   |   |   |--- weights: [6.71, 0.00] class: 0
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- lead_time <= 117.50
|   |   |   |   |   |--- avg_price_per_room <= 93.58
|   |   |   |   |   |   |--- avg_price_per_room <= 75.07
|   |   |   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 58.75
|   |   |   |   |   |   |   |   |   |--- weights: [5.96, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  58.75
|   |   |   |   |   |   |   |   |   |--- no_of_previous_bookings_not_canceled <= 1.00
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 4.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 118.41] class: 1
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  4.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |--- no_of_previous_bookings_not_canceled >  1.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.47, 0.00] class: 0
|   |   |   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |   |   |--- arrival_date <= 11.50
|   |   |   |   |   |   |   |   |   |--- weights: [31.31, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_date >  11.50
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [23.11, 6.07] class: 0
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [5.96, 9.11] class: 1
|   |   |   |   |   |   |--- avg_price_per_room >  75.07
|   |   |   |   |   |   |   |--- arrival_month <= 3.50
|   |   |   |   |   |   |   |   |--- weights: [59.64, 3.04] class: 0
|   |   |   |   |   |   |   |--- arrival_month >  3.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 4.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.49, 16.70] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  4.50
|   |   |   |   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 86.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 16.70] class: 1
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  86.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [8.95, 3.04] class: 0
|   |   |   |   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 22.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [44.73, 4.55] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  22.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |--- avg_price_per_room >  93.58
|   |   |   |   |   |   |--- arrival_date <= 11.50
|   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |--- weights: [16.40, 39.47] class: 1
|   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |--- weights: [20.13, 6.07] class: 0
|   |   |   |   |   |   |--- arrival_date >  11.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 102.09
|   |   |   |   |   |   |   |   |--- weights: [5.22, 144.22] class: 1
|   |   |   |   |   |   |   |--- avg_price_per_room >  102.09
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 109.50
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 16.70] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [33.55, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  109.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 124.25
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.98, 75.91] class: 1
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  124.25
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 3.04] class: 0
|   |   |   |   |--- lead_time >  117.50
|   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |--- arrival_date <= 7.50
|   |   |   |   |   |   |   |--- weights: [38.02, 0.00] class: 0
|   |   |   |   |   |   |--- arrival_date >  7.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 93.58
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 65.38
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  65.38
|   |   |   |   |   |   |   |   |   |--- weights: [24.60, 3.04] class: 0
|   |   |   |   |   |   |   |--- avg_price_per_room >  93.58
|   |   |   |   |   |   |   |   |--- arrival_date <= 28.00
|   |   |   |   |   |   |   |   |   |--- weights: [14.91, 72.87] class: 1
|   |   |   |   |   |   |   |   |--- arrival_date >  28.00
|   |   |   |   |   |   |   |   |   |--- weights: [9.69, 1.52] class: 0
|   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |--- weights: [84.25, 0.00] class: 0
|   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |--- lead_time <= 125.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 90.85
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 87.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [13.42, 13.66] class: 1
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  87.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 15.18] class: 1
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  90.85
|   |   |   |   |   |   |   |   |   |--- weights: [10.44, 0.00] class: 0
|   |   |   |   |   |   |   |--- lead_time >  125.50
|   |   |   |   |   |   |   |   |--- arrival_date <= 19.50
|   |   |   |   |   |   |   |   |   |--- weights: [58.15, 18.22] class: 0
|   |   |   |   |   |   |   |   |--- arrival_date >  19.50
|   |   |   |   |   |   |   |   |   |--- weights: [61.88, 1.52] class: 0
|   |   |--- market_segment_type_Online >  0.50
|   |   |   |--- lead_time <= 13.50
|   |   |   |   |--- avg_price_per_room <= 99.44
|   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |--- weights: [92.45, 0.00] class: 0
|   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |--- no_of_weekend_nights <= 1.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 70.05
|   |   |   |   |   |   |   |   |   |--- weights: [31.31, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  70.05
|   |   |   |   |   |   |   |   |   |--- lead_time <= 5.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [38.77, 1.52] class: 0
|   |   |   |   |   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |--- lead_time >  5.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [6.71, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [34.30, 40.99] class: 1
|   |   |   |   |   |   |   |--- no_of_weekend_nights >  1.50
|   |   |   |   |   |   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 19.74] class: 1
|   |   |   |   |   |   |   |   |--- no_of_adults >  1.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 74.21
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 3.04] class: 1
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  74.21
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [9.69, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- lead_time >  2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.47, 10.63] class: 1
|   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |   |   |   |--- weights: [155.07, 6.07] class: 0
|   |   |   |   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |--- weights: [3.73, 10.63] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [7.46, 0.00] class: 0
|   |   |   |   |--- avg_price_per_room >  99.44
|   |   |   |   |   |--- lead_time <= 3.50
|   |   |   |   |   |   |--- avg_price_per_room <= 202.67
|   |   |   |   |   |   |   |--- no_of_week_nights <= 4.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 5.50
|   |   |   |   |   |   |   |   |   |--- weights: [63.37, 30.36] class: 0
|   |   |   |   |   |   |   |   |--- arrival_month >  5.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 20.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [115.56, 12.14] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_date >  20.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 24.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  24.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [28.33, 3.04] class: 0
|   |   |   |   |   |   |   |--- no_of_week_nights >  4.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 6.07] class: 1
|   |   |   |   |   |   |--- avg_price_per_room >  202.67
|   |   |   |   |   |   |   |--- weights: [0.75, 22.77] class: 1
|   |   |   |   |   |--- lead_time >  3.50
|   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 119.25
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 118.50
|   |   |   |   |   |   |   |   |   |--- weights: [18.64, 59.21] class: 1
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  118.50
|   |   |   |   |   |   |   |   |   |--- weights: [8.20, 1.52] class: 0
|   |   |   |   |   |   |   |--- avg_price_per_room >  119.25
|   |   |   |   |   |   |   |   |--- weights: [34.30, 171.55] class: 1
|   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |--- weights: [26.09, 1.52] class: 0
|   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 14.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [9.69, 36.43] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_date >  14.00
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 208.67
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  208.67
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [15.66, 0.00] class: 0
|   |   |   |--- lead_time >  13.50
|   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |--- avg_price_per_room <= 71.92
|   |   |   |   |   |   |--- avg_price_per_room <= 59.43
|   |   |   |   |   |   |   |--- lead_time <= 84.50
|   |   |   |   |   |   |   |   |--- weights: [50.70, 7.59] class: 0
|   |   |   |   |   |   |   |--- lead_time >  84.50
|   |   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 27.00
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 131.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 15.18] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  131.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_date >  27.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |   |--- weights: [10.44, 0.00] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  59.43
|   |   |   |   |   |   |   |--- lead_time <= 25.50
|   |   |   |   |   |   |   |   |--- weights: [20.88, 6.07] class: 0
|   |   |   |   |   |   |   |--- lead_time >  25.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 71.34
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 3.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 68.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [15.66, 78.94] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  68.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- arrival_month >  3.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 102.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  102.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [12.67, 3.04] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  71.34
|   |   |   |   |   |   |   |   |   |--- weights: [11.18, 0.00] class: 0
|   |   |   |   |   |--- avg_price_per_room >  71.92
|   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |--- lead_time <= 65.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 120.45
|   |   |   |   |   |   |   |   |   |--- weights: [79.77, 9.11] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  120.45
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 12.14] class: 1
|   |   |   |   |   |   |   |--- lead_time >  65.50
|   |   |   |   |   |   |   |   |--- type_of_meal_plan_Meal Plan 2 <= 0.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 27.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [16.40, 47.06] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_date >  27.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- type_of_meal_plan_Meal Plan 2 >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 63.76] class: 1
|   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 104.31
|   |   |   |   |   |   |   |   |--- lead_time <= 25.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [16.40, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [38.77, 118.41] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [23.11, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- lead_time >  25.50
|   |   |   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [39.51, 185.21] class: 1
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [73.81, 411.41] class: 1
|   |   |   |   |   |   |   |--- avg_price_per_room >  104.31
|   |   |   |   |   |   |   |   |--- arrival_month <= 10.50
|   |   |   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 5 <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 195.30
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 9
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  195.30
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 138.15] class: 1
|   |   |   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 5 >  0.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_date <= 22.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.18, 6.07] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_date >  22.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 9.11] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  10.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 168.06
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 22.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  22.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [17.15, 83.50] class: 1
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  168.06
|   |   |   |   |   |   |   |   |   |   |--- weights: [12.67, 6.07] class: 0
|   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |--- weights: [48.46, 1.52] class: 0
|   |--- no_of_special_requests >  0.50
|   |   |--- no_of_special_requests <= 1.50
|   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |--- lead_time <= 102.50
|   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |--- weights: [697.09, 9.11] class: 0
|   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |--- lead_time <= 63.00
|   |   |   |   |   |   |   |--- weights: [15.66, 1.52] class: 0
|   |   |   |   |   |   |--- lead_time >  63.00
|   |   |   |   |   |   |   |--- weights: [0.00, 7.59] class: 1
|   |   |   |   |--- lead_time >  102.50
|   |   |   |   |   |--- no_of_week_nights <= 2.50
|   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |--- weights: [31.31, 13.66] class: 0
|   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |--- weights: [0.75, 6.07] class: 1
|   |   |   |   |   |--- no_of_week_nights >  2.50
|   |   |   |   |   |   |--- weights: [44.73, 3.04] class: 0
|   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |--- lead_time <= 8.50
|   |   |   |   |   |--- lead_time <= 4.50
|   |   |   |   |   |   |--- no_of_week_nights <= 10.00
|   |   |   |   |   |   |   |--- weights: [498.03, 40.99] class: 0
|   |   |   |   |   |   |--- no_of_week_nights >  10.00
|   |   |   |   |   |   |   |--- weights: [0.00, 3.04] class: 1
|   |   |   |   |   |--- lead_time >  4.50
|   |   |   |   |   |   |--- arrival_date <= 13.50
|   |   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |   |--- weights: [58.90, 36.43] class: 0
|   |   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |   |--- weights: [33.55, 1.52] class: 0
|   |   |   |   |   |   |--- arrival_date >  13.50
|   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [123.76, 9.11] class: 0
|   |   |   |   |   |   |   |--- type_of_meal_plan_Not Selected >  0.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 126.33
|   |   |   |   |   |   |   |   |   |--- weights: [32.80, 3.04] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  126.33
|   |   |   |   |   |   |   |   |   |--- weights: [9.69, 13.66] class: 1
|   |   |   |   |--- lead_time >  8.50
|   |   |   |   |   |--- required_car_parking_space <= 0.50
|   |   |   |   |   |   |--- avg_price_per_room <= 118.55
|   |   |   |   |   |   |   |--- lead_time <= 61.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [70.08, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_month >  1.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 4.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 11
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  4.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [126.74, 1.52] class: 0
|   |   |   |   |   |   |   |--- lead_time >  61.50
|   |   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 7.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.47, 57.69] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_month >  7.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 66.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  66.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 5
|   |   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 71.93
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [54.43, 3.04] class: 0
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  71.93
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 10
|   |   |   |   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |--- avg_price_per_room >  118.55
|   |   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |   |--- arrival_date <= 19.50
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 7.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 177.15
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 6
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  177.15
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  7.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 6.07] class: 1
|   |   |   |   |   |   |   |   |--- arrival_date >  19.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 27.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 121.20
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [18.64, 6.07] class: 0
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  121.20
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |--- arrival_date >  27.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 55.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  55.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [11.93, 10.63] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [37.28, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 119.20
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [9.69, 28.84] class: 1
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  119.20
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 12
|   |   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 100.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [49.95, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  100.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.75, 18.22] class: 1
|   |   |   |   |   |--- required_car_parking_space >  0.50
|   |   |   |   |   |   |--- weights: [134.20, 1.52] class: 0
|   |   |--- no_of_special_requests >  1.50
|   |   |   |--- lead_time <= 90.50
|   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |--- weights: [1585.04, 0.00] class: 0
|   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- no_of_week_nights <= 9.50
|   |   |   |   |   |   |   |--- lead_time <= 6.50
|   |   |   |   |   |   |   |   |--- weights: [32.06, 0.00] class: 0
|   |   |   |   |   |   |   |--- lead_time >  6.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 5.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [23.11, 1.52] class: 0
|   |   |   |   |   |   |   |   |   |--- arrival_date >  5.50
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 93.09
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  93.09
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [77.54, 27.33] class: 0
|   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |--- weights: [19.38, 0.00] class: 0
|   |   |   |   |   |   |--- no_of_week_nights >  9.50
|   |   |   |   |   |   |   |--- weights: [0.00, 3.04] class: 1
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [52.19, 0.00] class: 0
|   |   |   |--- lead_time >  90.50
|   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |--- avg_price_per_room <= 202.95
|   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |--- arrival_month <= 7.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.49, 9.11] class: 1
|   |   |   |   |   |   |   |   |--- arrival_month >  7.50
|   |   |   |   |   |   |   |   |   |--- weights: [8.20, 3.04] class: 0
|   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |--- lead_time <= 150.50
|   |   |   |   |   |   |   |   |   |--- weights: [175.20, 28.84] class: 0
|   |   |   |   |   |   |   |   |--- lead_time >  150.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |--- avg_price_per_room >  202.95
|   |   |   |   |   |   |   |--- weights: [0.00, 10.63] class: 1
|   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |--- avg_price_per_room <= 153.15
|   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 2 <= 0.50
|   |   |   |   |   |   |   |   |--- avg_price_per_room <= 71.12
|   |   |   |   |   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- avg_price_per_room >  71.12
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 90.42
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [12.67, 7.59] class: 0
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  90.42
|   |   |   |   |   |   |   |   |   |   |--- weights: [64.12, 60.72] class: 0
|   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 2 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [5.96, 0.00] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  153.15
|   |   |   |   |   |   |   |--- weights: [12.67, 3.04] class: 0
|   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |--- weights: [67.10, 0.00] class: 0
|--- lead_time >  151.50
|   |--- avg_price_per_room <= 100.04
|   |   |--- no_of_special_requests <= 0.50
|   |   |   |--- no_of_adults <= 1.50
|   |   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |   |--- lead_time <= 163.50
|   |   |   |   |   |   |--- arrival_month <= 5.00
|   |   |   |   |   |   |   |--- weights: [2.98, 0.00] class: 0
|   |   |   |   |   |   |--- arrival_month >  5.00
|   |   |   |   |   |   |   |--- weights: [0.75, 24.29] class: 1
|   |   |   |   |   |--- lead_time >  163.50
|   |   |   |   |   |   |--- lead_time <= 341.00
|   |   |   |   |   |   |   |--- lead_time <= 173.00
|   |   |   |   |   |   |   |   |--- arrival_date <= 3.50
|   |   |   |   |   |   |   |   |   |--- weights: [46.97, 9.11] class: 0
|   |   |   |   |   |   |   |   |--- arrival_date >  3.50
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 1.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 13.66] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  1.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 0.00] class: 0
|   |   |   |   |   |   |   |--- lead_time >  173.00
|   |   |   |   |   |   |   |   |--- arrival_month <= 5.50
|   |   |   |   |   |   |   |   |   |--- arrival_date <= 7.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |   |   |--- arrival_date >  7.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [6.71, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_month >  5.50
|   |   |   |   |   |   |   |   |   |--- weights: [188.62, 7.59] class: 0
|   |   |   |   |   |   |--- lead_time >  341.00
|   |   |   |   |   |   |   |--- weights: [13.42, 27.33] class: 1
|   |   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |   |--- avg_price_per_room <= 2.50
|   |   |   |   |   |   |--- lead_time <= 285.50
|   |   |   |   |   |   |   |--- weights: [8.20, 0.00] class: 0
|   |   |   |   |   |   |--- lead_time >  285.50
|   |   |   |   |   |   |   |--- weights: [0.75, 3.04] class: 1
|   |   |   |   |   |--- avg_price_per_room >  2.50
|   |   |   |   |   |   |--- weights: [0.75, 97.16] class: 1
|   |   |   |--- no_of_adults >  1.50
|   |   |   |   |--- avg_price_per_room <= 82.47
|   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |--- weights: [2.98, 282.37] class: 1
|   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |--- lead_time <= 244.00
|   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 166.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  166.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 57.69] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [17.89, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- arrival_month <= 9.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.18, 3.04] class: 0
|   |   |   |   |   |   |   |   |   |   |--- arrival_month >  9.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 12.14] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [75.30, 12.14] class: 0
|   |   |   |   |   |   |   |--- lead_time >  244.00
|   |   |   |   |   |   |   |   |--- arrival_year <= 2017.50
|   |   |   |   |   |   |   |   |   |--- weights: [25.35, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- arrival_year >  2017.50
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room <= 80.38
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [11.18, 264.15] class: 1
|   |   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  3.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- avg_price_per_room >  80.38
|   |   |   |   |   |   |   |   |   |   |--- weights: [7.46, 0.00] class: 0
|   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |--- weights: [46.22, 0.00] class: 0
|   |   |   |   |--- avg_price_per_room >  82.47
|   |   |   |   |   |--- no_of_adults <= 2.50
|   |   |   |   |   |   |--- lead_time <= 324.50
|   |   |   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 4 <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [7.46, 986.78] class: 1
|   |   |   |   |   |   |   |   |--- room_type_reserved_Room_Type 4 >  0.50
|   |   |   |   |   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 10.63] class: 1
|   |   |   |   |   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.47, 0.00] class: 0
|   |   |   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 19.74] class: 1
|   |   |   |   |   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |   |   |   |   |--- lead_time >  324.50
|   |   |   |   |   |   |   |--- avg_price_per_room <= 89.00
|   |   |   |   |   |   |   |   |--- weights: [5.96, 0.00] class: 0
|   |   |   |   |   |   |   |--- avg_price_per_room >  89.00
|   |   |   |   |   |   |   |   |--- weights: [0.75, 13.66] class: 1
|   |   |   |   |   |--- no_of_adults >  2.50
|   |   |   |   |   |   |--- weights: [5.22, 0.00] class: 0
|   |   |--- no_of_special_requests >  0.50
|   |   |   |--- no_of_weekend_nights <= 0.50
|   |   |   |   |--- lead_time <= 180.50
|   |   |   |   |   |--- lead_time <= 159.50
|   |   |   |   |   |   |--- arrival_month <= 8.50
|   |   |   |   |   |   |   |--- weights: [5.96, 0.00] class: 0
|   |   |   |   |   |   |--- arrival_month >  8.50
|   |   |   |   |   |   |   |--- weights: [1.49, 7.59] class: 1
|   |   |   |   |   |--- lead_time >  159.50
|   |   |   |   |   |   |--- arrival_date <= 1.50
|   |   |   |   |   |   |   |--- weights: [1.49, 3.04] class: 1
|   |   |   |   |   |   |--- arrival_date >  1.50
|   |   |   |   |   |   |   |--- weights: [35.79, 1.52] class: 0
|   |   |   |   |--- lead_time >  180.50
|   |   |   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |   |   |--- market_segment_type_Online <= 0.50
|   |   |   |   |   |   |   |--- no_of_adults <= 2.50
|   |   |   |   |   |   |   |   |--- weights: [12.67, 3.04] class: 0
|   |   |   |   |   |   |   |--- no_of_adults >  2.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 3.04] class: 1
|   |   |   |   |   |   |--- market_segment_type_Online >  0.50
|   |   |   |   |   |   |   |--- weights: [7.46, 206.46] class: 1
|   |   |   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |   |   |--- weights: [8.95, 0.00] class: 0
|   |   |   |--- no_of_weekend_nights >  0.50
|   |   |   |   |--- market_segment_type_Offline <= 0.50
|   |   |   |   |   |--- arrival_month <= 11.50
|   |   |   |   |   |   |--- avg_price_per_room <= 76.48
|   |   |   |   |   |   |   |--- weights: [46.97, 4.55] class: 0
|   |   |   |   |   |   |--- avg_price_per_room >  76.48
|   |   |   |   |   |   |   |--- no_of_week_nights <= 6.50
|   |   |   |   |   |   |   |   |--- arrival_date <= 27.50
|   |   |   |   |   |   |   |   |   |--- lead_time <= 233.00
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 152.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.49, 4.55] class: 1
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  152.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |--- lead_time >  233.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [23.11, 19.74] class: 0
|   |   |   |   |   |   |   |   |--- arrival_date >  27.50
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.24, 15.18] class: 1
|   |   |   |   |   |   |   |   |   |--- no_of_week_nights >  1.50
|   |   |   |   |   |   |   |   |   |   |--- lead_time <= 269.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |   |   |   |--- lead_time >  269.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.55] class: 1
|   |   |   |   |   |   |   |--- no_of_week_nights >  6.50
|   |   |   |   |   |   |   |   |--- weights: [4.47, 13.66] class: 1
|   |   |   |   |   |--- arrival_month >  11.50
|   |   |   |   |   |   |--- arrival_date <= 14.50
|   |   |   |   |   |   |   |--- weights: [8.20, 3.04] class: 0
|   |   |   |   |   |   |--- arrival_date >  14.50
|   |   |   |   |   |   |   |--- weights: [11.18, 31.88] class: 1
|   |   |   |   |--- market_segment_type_Offline >  0.50
|   |   |   |   |   |--- lead_time <= 348.50
|   |   |   |   |   |   |--- weights: [106.61, 3.04] class: 0
|   |   |   |   |   |--- lead_time >  348.50
|   |   |   |   |   |   |--- weights: [5.96, 4.55] class: 0
|   |--- avg_price_per_room >  100.04
|   |   |--- arrival_month <= 11.50
|   |   |   |--- no_of_special_requests <= 2.50
|   |   |   |   |--- weights: [0.00, 3200.19] class: 1
|   |   |   |--- no_of_special_requests >  2.50
|   |   |   |   |--- weights: [23.11, 0.00] class: 0
|   |   |--- arrival_month >  11.50
|   |   |   |--- no_of_special_requests <= 0.50
|   |   |   |   |--- weights: [35.04, 0.00] class: 0
|   |   |   |--- no_of_special_requests >  0.50
|   |   |   |   |--- arrival_date <= 24.50
|   |   |   |   |   |--- weights: [3.73, 0.00] class: 0
|   |   |   |   |--- arrival_date >  24.50
|   |   |   |   |   |--- weights: [3.73, 22.77] class: 1

In [157]:
importances = best_model.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
No description has been provided for this image
  • In the post tuned decision tree, Lead time is the most important features.

Model Performance Comparison and Conclusions¶

Comparing Decision Tree models¶

In [158]:
# training performance comparison

models_train_comp_df = pd.concat(
    [
        decision_tree_perf_train.T,
        decision_tree_tune_perf_train.T,
        decision_tree_post_perf_train.T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[158]:
Decision Tree sklearn Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.99421 0.83101 0.90005
Recall 0.98661 0.78620 0.90350
Precision 0.99578 0.72428 0.81361
F1 0.99117 0.75397 0.85620
In [159]:
# testing performance comparison
models_test_comp_df = pd.concat(
    [
        decision_tree_perf_test.T,
        decision_tree_tune_perf_test.T,
        decision_tree_post_perf_test.T,
        ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree sklearn",
    "Decision Tree (Pre-Pruning)",
    "Decision Tree (Post-Pruning)",
]
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
Out[159]:
Decision Tree sklearn Decision Tree (Pre-Pruning) Decision Tree (Post-Pruning)
Accuracy 0.87108 0.83497 0.86869
Recall 0.81034 0.78336 0.85576
Precision 0.79521 0.72758 0.76595
F1 0.80270 0.75444 0.80837

Observations

  • Decision Tree (sklearn) has very high training performance (Accuracy ≈ 0.99, F1 ≈ 0.99) but drops significantly on the test set (Accuracy ≈ 0.87, F1 ≈ 0.80). This suggests overfitting — the model fits training data extremely well but generalizes less effectively.

  • Pre-pruning reduces overfitting but also decreases training performance (Accuracy ≈ 0.83, F1 ≈ 0.75). Test results are more balanced (Accuracy ≈ 0.83, F1 ≈ 0.75) but overall lower, indicating possible underfitting.

  • Post-pruning strikes a better balance: slightly lower training performance than the unpruned tree, but the best test performance overall (F1 ≈ 0.81, Recall ≈ 0.85). This shows improved generalization and reduced overfitting compared to the raw sklearn tree.

Actionable Insights and Recommendations¶

  • What profitable policies for cancellations and refunds can the hotel adopt?
  • What other recommedations would you suggest to the hotel?

Based on the analysis and model performance, here are some actionable insights and recommendations for INN Hotels Group to formulate profitable policies for cancellations and refunds and to suggest other improvements:

Profitable Policies for Cancellations and Refunds:

  1. Implement Dynamic Cancellation Fees based on Lead Time: The analysis clearly shows that bookings with longer lead times have a higher probability of being canceled. Consider implementing a tiered cancellation fee structure where the fee increases as the arrival date gets closer. This incentivizes guests to commit or cancel earlier, allowing the hotel more time to re-sell the room. For very short lead times, a higher fee or non-refundable option could be considered for certain segments.

  2. Offer Incentives for Non-Refundable Rates for High-Risk Segments: For market segments or booking characteristics identified as having a higher cancellation risk (e.g., certain market segments like 'Online', specific room types, or bookings without special requests), offer a slightly discounted non-refundable rate option. This can help secure revenue upfront for potentially volatile bookings.

  3. Review Meal Plan and Room Type Pricing and Policies: The analysis indicated that Meal Plan 2 and certain Room Types (Room Type 4 and Room Type 6) have higher cancellation rates. Investigate if the pricing or terms associated with these options contribute to cancellations. Consider adjusting pricing or offering more flexible cancellation terms for less popular or higher-risk options to attract more committed bookings.

  4. Leverage Special Requests to Reduce Cancellations: Bookings with more special requests have a significantly lower cancellation rate. Encourage guests to make special requests during the booking process. This could be done through the booking platform interface or targeted communication, as it seems to indicate a higher level of commitment to the stay.

  5. Tailor Policies for Repeated Guests: Repeated guests have a very low cancellation rate. Offer these loyal customers more flexible cancellation policies or exclusive benefits as a reward for their loyalty and lower risk profile. This strengthens brand equity and encourages continued business.

  6. Consider Deposit or Pre-payment Requirements for High-Risk Bookings: For bookings flagged by the predictive model as having a high probability of cancellation, consider requiring a partial deposit or full pre-payment, especially for longer lead times or during peak seasons.

  7. Analyze Cancellation Reasons (if data is available): While the current dataset doesn't include cancellation reasons, collecting this data would be invaluable for understanding the root causes of cancellations and developing more targeted policies and operational improvements.

Other Recommendations:

  1. Utilize the Predictive Model for Targeted Interventions: Implement the trained Decision Tree model (post-pruning) to identify bookings with a high probability of cancellation in advance. The hotel can then proactively engage with these guests through personalized communication (e.g., reminder emails, special offers to encourage commitment) to try and prevent the cancellation.

  2. Optimize Marketing Strategies based on Market Segment Analysis: Focus marketing efforts on market segments with lower cancellation rates, such as the 'Corporate' and 'Offline' segments. While the 'Online' segment brings in the most bookings, the higher cancellation rate suggests a need to either improve targeting within this segment or find ways to increase booking commitment.

  3. Analyze Arrival Month and Date Trends: The analysis showed seasonality and variations in cancellation rates by month and date. Use this information for pricing strategies, staffing levels, and targeted promotions during periods with historically higher cancellation rates.

  4. Investigate Reasons for Cancellations in High-Risk Months/Dates: Deep dive into the characteristics of bookings canceled during months and dates with particularly high cancellation rates to identify any common factors not captured by the current features.

  5. Improve the Booking Experience for High-Risk Segments: Review the online and offline booking processes for segments with high cancellation rates to ensure clarity of terms and conditions, especially regarding cancellation policies. A smoother and more transparent booking experience might reduce impulse bookings that are later canceled.

  6. Monitor and Re-evaluate Policies Regularly: Continuously monitor the effectiveness of implemented policies and the performance of the predictive model. The booking landscape is dynamic, and policies should be adjusted based on observed trends and model performance over time.

By implementing these data-driven policies and recommendations, INN Hotels Group can aim to reduce booking cancellations, minimize revenue loss, and optimize their operational efficiency.